Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, , , telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and no data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's 2002 acquisition of Compaq and its 2015 split into HP Inc. and Hewlett Packard Enterprise.
Tandem's NonStop systems use a number of independent identical processors, redundant storage devices, and redundant controllers to provide automatic high-speed "failover" in the case of a hardware or software failure. To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state.
Besides masking failures, this "shared-nothing" messaging system design also scales to the largest commercial workloads. Each doubling of the total number of processors doubles system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors in that manner gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete against IBM's largest mainframes, despite being built from simpler minicomputer technology.
Each engineer was confident they could quickly pull off their own part of this complex new design but doubted that others' areas could be worked out. The parts of the hardware and software design that did not have to be different were largely based on incremental improvements to the familiar hardware and software designs of the HP 3000. Many subsequent engineers and programmers also came from HP. Tandem headquarters in Cupertino, California, were a quarter mile away from the HP offices. Initial venture capital investment in Tandem Computers came from Tom Perkins, who was formerly a general manager of the HP 3000 division.
The business plan included detailed ideas for building a unique corporate culture reflecting Treybig's values.
The design of the initial Tandem/16 hardware was completed in 1975, and the first system shipped to Citibank in May 1976.
The company enjoyed uninterrupted exponential growth through 1983. Inc. magazine ranked Tandem as the fastest-growing public company in America. By 1996, Tandem was a $2.3 billion company employing approximately 8,000 people worldwide.
While conventional systems of the era, including large mainframes, had mean-time-between-failures (MTBF) on the order of a few days, the NonStop system was designed to failure intervals 100 times longer, with measured in years. Nevertheless, the NonStop was designed to be price-competitive with conventional systems, with a simple 2-CPU system priced at just over twice that of a competing single-processor mainframe, as opposed to four or more times of other fault-tolerant solutions.
Besides recovering well from failed parts, the T/16 was also designed to detect as many kinds of intermittent failures as possible, as soon as possible. This prompt detection is called "fail fast". The point was to find and isolate corrupted data before it was permanently written into databases and other disk files. In the T/16, error detection was by added custom circuits that added little cost to the total design; no major parts were duplicated to get error detection. The T/16 CPU was a proprietary design. It was greatly influenced by the HP 3000 minicomputer. They were both microcode, 16-bit, stack machine with segmented, 16-bit virtual addressing. Both were intended to be programmed exclusively in high-level languages, with no use of assembler. Both were initially implemented via standard low-density TTL chips, each holding a 4-bit slice of the 16-bit ALU. Both had a small number of top-of-stack, 16-bit data registers plus some extra address registers for accessing the memory stack. Both used Huffman encoding of operand address offsets, to fit a large variety of address modes and offset sizes into the 16-bit instruction format with good code density. Both relied heavily on pools of indirect addresses to overcome the short instruction format. Both supported larger 32- and 64-bit operands via multiple ALU cycles, and memory-to-memory string operations. Both used "big-endian" addressing of long versus short memory operands. These features had all been inspired by Burroughs B5500–B6800 mainframe stack machines.
The T/16 instruction set changed several features from the HP 3000 design. The T/16 supported paged virtual memory from the beginning. The HP 3000 series did not add paging until the PA-RISC generation, 10 years later (although via MPE V it had a form of paging using the APL firmware, in 1978). Tandem added support for 32-bit addressing in its second machine; HP 3000 lacked this until its PA-RISC generation. Paging and long addresses were critical for supporting complex system software and large applications. The T/16 treated its top-of-stack registers in a novel way; the compiler, not the microcode, was responsible for deciding when full registers were spilled to the memory stack and when empty registers were re-filled from the memory stack. On the HP 3000, this decision took extra microcode cycles in every instruction. The HP 3000 supported COBOL with several instructions for calculating directly on arbitrary-length BCD (binary-coded decimal) strings of digits. The T/16 simplified this to single instructions for converting between BCD strings and 64-bit binary integers.
In the T/16, each CPU consisted of two boards of TTL logic and SRAMs, and ran at about 0.7 MIPS. At any instant, it could access only four virtual memory segments (System Data, System Code, User Data, User Code), each limited to 128 KB in size. The 16-bit address spaces were already small for major applications when it shipped.
The first release of T/16 had only a single programming language, Transaction Application Language (TAL). This was an efficient machine-dependent systems programming language (for operating systems, compilers, etc.) but could also be used for non-portable applications. It was derived from HP 3000's System Programming Language (SPL). Both had semantics similar to C but a syntax based on Burroughs' ALGOL. Subsequent releases added support for Cobol74, BASIC, Fortran, Java, C, C++, and MUMPS.
The Tandem NonStop series ran a custom operating system which was significantly different from Unix or HP 3000's MPE. It was initially called T/TOS ( Tandem Transactional Operating System) but soon named Guardian for its ability to protect all data from machine faults and software faults. In contrast to all other commercial operating systems, Guardian was based on message passing as the basic way for all processes to interact, without shared memory, regardless of where the processes were running. This approach easily scaled to multiple-computer clusters and helped isolate corrupted data before it propagated.
All file system processes and all transactional application processes were structured as master/slave pairs of processes running in separate CPUs. The slave process periodically took snapshots of the master's memory state and took over the workload if and when the master process ran into trouble. This allowed the application to survive failures in any CPU or its associated devices, without data loss. It further allowed recovery from some intermittent-style software failures. Between failures, the monitoring by the slave process added some performance overhead but this was far less than the 100% duplication in other system designs. Some major early applications were directly coded in this checkpoint style, but most instead used various Tandem software layers which hid the details of this in a semi-portable way.
Tandem's initial database support was only for hierarchical, non-relational databases via the Enscribe file system. This was extended into a relational database called ENCOMPASS. In 1986 Tandem introduced the first fault-tolerant SQL database, NonStop SQL. Developed totally in-house, NonStop SQL includes a number of features based on Guardian to ensure data validity across nodes. NonStop SQL is known for scalability in speedup with the number of nodes added to the system, whereas most databases had performance that plateaued quite quickly, often after just two CPUs. A later version released in 1989 added transactions that could be spread over nodes, a feature that remained unique for some time. NonStop SQL continued to evolve, first as NonStop SQL/MP and then NonStop SQL/MX, which transitioned from Tandem to Compaq to HP. The code remains in use in both HP's NonStop SQL/MP, NonStop SQL/MX and the Apache Trafodion project.
Like Tandem's prior high-end machines, Cyclone cabinets were styled with much angular black to suggest strength and power. Advertising videos directly compared Cyclone to the Lockheed SR-71 Blackbird Mach 3 spy plane. Cyclone's name was supposed to represent its "unstoppable speed in roaring through OLTP workloads". Announcement day was October 17, 1989. That afternoon, the region was struck by the magnitude 6.9 Loma Prieta earthquake, causing freeway collapses in Oakland and major fires in San Francisco. Tandem offices were shaken, but no one was badly hurt on site.
Development of Rainbow's advanced client/server application development framework called "Crystal" continued awhile longer and was spun off as the "Ellipse" product of Cooperative Systems Incorporated.Exec details firm's net-based OLTP tools, Network World, March 16, 1992
The company in 1986 introduced the 6AT, an IBM PC AT-compatible computer. Tandem only sold the 6AT to existing customers; "we are not going to go out and innovate", it said.
In such systems, the spare processors do not contribute to system throughput between failures, but merely redundantly execute exactly the same data thread as the active processor at the same instant, in "lock step". Faults are detected by seeing when the cloned processors' outputs diverged. To detect failures, the system must have two physical processors for each logical, active processor. To also implement automatic failover recovery, the system must have three or four physical processors for each logical processor. The triple or quadruple cost of this sparing is practical when the duplicated parts are commodity single-chip microprocessors.
Tandem's products for this market began with the Integrity line in 1989, using MIPS processors and a "NonStop UX" variant of Unix. It was developed in Austin, Texas. In 1991, the Integrity S2 used TMR, Triple Modular Redundancy, where each logical CPU used three MIPS R2000 microprocessors to execute the same data thread, with voting to find and lock out a failed part. Their fast clocks could not be synchronized as in strict lock stepping, so voting instead happened at each interrupt. Some other versions of Integrity used 4x "pair and spares" redundancy. Pairs of processors ran in lock-step to check each other. When they disagreed, both processors were marked untrusted, and their workload was taken over by a hot-spare pair of processors whose state was already current. In 1995, the Integrity S4000 was the first to use ServerNet (a networked "bus" structure) and moved toward sharing peripherals with the NonStop line.
The R3000 and later microprocessors had only a typical amount of internal error checking, insufficient for Tandem's needs. So, the Cyclone/R ran pairs of R3000 processors in lock step, running the same data thread. This was for purposes of data integrity, and not fault-tolerance – fault tolerance was handled by the other mechanisms still in place. It used a variation of lock stepping. The checker processor ran 1 cycle behind the primary processor. This allowed them to share a single copy of external code and data caches without putting excessive pinout load on the system bus and lowering the system clock rate. To successfully run microprocessors in lock step, the chips must be designed to be fully deterministic. Any hidden internal state must be cleared by the chip's reset mechanism. Otherwise, the matched chips can go out of sync for no visible reason and without any faults, long after the chips are restarted. Chip designers agree that these are good principles because it helps them test chips at manufacturing time. But all new microprocessor chips seemed to have bugs in this area and required months of shared work between MIPS (the third-party manufacturer used by Tandem) and Tandem to eliminate or work around the final subtle bugs.
All S-Series machines used MIPS processors, including the R4400, R10000, R12000, and R14000.
The design of the later, faster MIPS cores was primarily funded by Silicon Graphics. But Intel sixth generation Pentium Pro overtook the performance of RISC designs, and also SGI's graphics business shrank. After the R10000, there was no investment in significant new MIPS core designs for high-end servers. So Tandem needed to move its NonStop product line to another microprocessor architecture with competitive fast chips.
Compaq's x86-based server division was an early outside adopter of Tandem's ServerNet/InfiniBand interconnect technology. In 1997, Compaq acquired the Tandem Computers company and NonStop customer base to balance Compaq's heavy focus on personal computers (PCs). In 1998, Compaq also acquired the much larger Digital Equipment Corporation and inherited its DEC Alpha RISC servers with OpenVMS and Tru64 Unix customer bases. Tandem was then midway in porting its NonStop product line from MIPS R12000 microprocessors to Intel's new Itanium Merced microprocessors. This project was restarted with Alpha as the new target to align NonStop with Compaq's other large server lines. But in 2001, Compaq terminated all Alpha engineering investments in favor of the Itanium microprocessors, before any new NonStop products were released on Alpha.
In some ways, Tandem's journey from HP-inspired start-up to an HP-inspired competitor, then to an HP division was "bringing Tandem back to its original roots", but this was not the same HP.
The porting of the NSK-based NonStop product line from MIPS processors to Itanium-based processors was completed and was branded as "HP Integrity NonStop Servers". (This NSK Integrity NonStop was unrelated to Tandem's original "Integrity" series for Unix.)
Because it was not possible to run Itanium McKinley chips with clock-level lock stepping, the Integrity NonStop machines instead lock stepped using comparisons between chip states at longer time scales, at interrupt points and at various software synchronization points in between interrupts. The intermediate synchronization points were automatically triggered at every n'th taken branch instruction and were also explicitly inserted into long loop bodies by all NonStop compilers. The machine design supported both dual and triple redundancy, with either two or three physical microprocessors per logical Itanium processor. The triple version was sold to customers needing the utmost reliability. This new checking approach was called NSAA, NonStop Advanced Architecture.
As in the earlier migration from stack machines to MIPS microprocessors, all customer software was carried forward without source changes. "Native mode" source code compiled directly to MIPS machine code was simply recompiled for Itanium. Some older "non-native" software was still in TNS stack machine form. These were automatically ported onto Itanium via object code translation techniques.
The inclusion of the fault-tolerant 4X FDR (Fourteen Data Rate) InfiniBand double-wide switches provided more than 25 times increase in system interconnect capacity.
|
|